Modeling

Initialize Document

Read In Data

traffic = read.csv("DenTraffic2019ToModel.csv")
traffic$SEGMDIR = as.factor(traffic$SEGMDIR)
traffic$FUNCCLASSI = as.factor(traffic$FUNCCLASSI)
traffic = subset(traffic, select = -c(X))

Linear Model 1 - All Features, No Interactions

interaction of road size and number of lanes

mod1 = lm(AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) + SURFWD + PSI + THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
summary(mod1)
## 
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) + 
##     SURFWD + PSI + THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23351  -2975   -712   2125  34405 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      16587.797    999.726  16.592  < 2e-16 ***
## SEG_LENGTH                         724.590   1542.034   0.470  0.63845    
## I(SEGMDIR)N                        -48.911    216.170  -0.226  0.82101    
## I(SEGMDIR)NE                     -2667.398    305.477  -8.732  < 2e-16 ***
## I(SEGMDIR)NW                     -2367.164    371.460  -6.373 2.02e-10 ***
## I(SEGMDIR)S                       1331.897    233.969   5.693 1.32e-08 ***
## I(SEGMDIR)SE                       105.598    439.563   0.240  0.81016    
## I(SEGMDIR)SW                      1623.074   1204.315   1.348  0.17781    
## I(SEGMDIR)W                        598.901    213.841   2.801  0.00512 ** 
## I(FUNCCLASSI)4  Minor Arterial  -12515.773    228.678 -54.731  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16972.644    253.597 -66.928  < 2e-16 ***
## SURFWD                              37.523      4.252   8.824  < 2e-16 ***
## PSI                                -63.672     73.479  -0.867  0.38624    
## THRULNQTY                         2431.109     91.211  26.654  < 2e-16 ***
## THRULNWD                           -64.456     83.470  -0.772  0.44003    
## RUNLENGTH1                        -118.780     78.603  -1.511  0.13082    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5190 on 5102 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7253 
## F-statistic: 901.8 on 15 and 5102 DF,  p-value: < 2.2e-16

The only features for which we don’t have overwhelming evidence are effective in predicting \(\verb|AADT|\) are \(\verb|SEG_LENGTH|\) , some directions of \(\verb|SEGDIR|\) , \(\verb|PSI|\) , \(\verb|THRULNWD|\) and \(\verb|RUNLENGTH1|\) . After a quick collinearity check, we will systematically remove the variables one at a time.

VIF

This checks for collinearity in our features. This statistic is on a scale that starts at 1, and lower values are indicative of no collinearity.

vif(mod1)
##                   GVIF Df GVIF^(1/(2*Df))
## SEG_LENGTH    1.076254  1        1.037427
## I(SEGMDIR)    1.365241  7        1.022487
## I(FUNCCLASSI) 1.742328  2        1.148901
## SURFWD        1.375148  1        1.172667
## PSI           1.054805  1        1.027037
## THRULNQTY     2.051504  1        1.432307
## THRULNWD      1.087260  1        1.042717
## RUNLENGTH1    1.107105  1        1.052191

None of these features are of concern. This supports the conclusion we started to come to with the correlation heat-map to keep each feature in our first model.

ANOVA and Variable Selection

Based off our the significant levels from our model, we will run an ANOVA test to see if the features just mentioned help the model enough to keep them in. We will omit them in if the p-score associated with their F-statistic (for variables with factors) or p-value (for numeric variables) is greater than 0.10.

mod1sansDIR = update(mod1,.~.-I(SEGMDIR))
anova(mod1sansDIR,mod1)
## Analysis of Variance Table
## 
## Model 1: AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI + THRULNQTY + 
##     THRULNWD + RUNLENGTH1
## Model 2: AADT ~ SEG_LENGTH + I(SEGMDIR) + I(FUNCCLASSI) + SURFWD + PSI + 
##     THRULNQTY + THRULNWD + RUNLENGTH1
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   5109 1.4289e+11                                   
## 2   5102 1.3743e+11  7 5456470847 28.939 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An F-statistic of 20.388 and p-value of 2.2e-16 indicates we should keep \(\verb|SEGMDIR|\) in the model. This warrants more discussion though, as many of the individual directions didn’t have a significant impact on the model. There isn’t any science to support that the direction of travel would affect the average daily traffic. Furthermore, in conjunction with our visualization of the miles of road in each direction in Denver, there is evidence to suggest this data is not useful for predicting \(\verb|AADT|\) . Accordingly, we will remove it. H

mod1 = mod1sansDIR
summary(mod1)
## 
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI + 
##     THRULNQTY + THRULNWD + RUNLENGTH1, data = traffic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22953  -2985   -717   2464  35516 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      18365.327    989.925  18.552  < 2e-16 ***
## SEG_LENGTH                        1698.624   1548.703   1.097  0.27278    
## I(FUNCCLASSI)4  Minor Arterial  -12479.458    231.137 -53.992  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16784.120    255.760 -65.624  < 2e-16 ***
## SURFWD                              40.171      4.289   9.367  < 2e-16 ***
## PSI                                -13.546     73.638  -0.184  0.85406    
## THRULNQTY                         2332.657     90.694  25.720  < 2e-16 ***
## THRULNWD                          -263.928     82.337  -3.205  0.00136 ** 
## RUNLENGTH1                         -14.140     78.281  -0.181  0.85667    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5288 on 5109 degrees of freedom
## Multiple R-squared:  0.7153, Adjusted R-squared:  0.7148 
## F-statistic:  1604 on 8 and 5109 DF,  p-value: < 2.2e-16

So without \(\verb|SEGMDIR|\) the largest p-value is \(\verb|RUNLENGTH1|\). So let’s see how the model is without it.

mod1 = update(mod1,.~.-RUNLENGTH1)
summary(mod1)
## 
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + PSI + 
##     THRULNQTY + THRULNWD, data = traffic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22954  -3002   -717   2469  35528 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      18341.672    981.132  18.694  < 2e-16 ***
## SEG_LENGTH                        1700.344   1548.527   1.098  0.27224    
## I(FUNCCLASSI)4  Minor Arterial  -12476.347    230.473 -54.134  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16775.285    251.016 -66.830  < 2e-16 ***
## SURFWD                              40.156      4.287   9.366  < 2e-16 ***
## PSI                                -14.638     73.382  -0.199  0.84190    
## THRULNQTY                         2334.305     90.225  25.872  < 2e-16 ***
## THRULNWD                          -263.679     82.317  -3.203  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5288 on 5110 degrees of freedom
## Multiple R-squared:  0.7153, Adjusted R-squared:  0.7149 
## F-statistic:  1834 on 7 and 5110 DF,  p-value: < 2.2e-16

So without \(\verb|RUNLENGTH1|\) the largest p-value is \(\verb|PSI|\). So let’s see how the model is without it.

mod1 = update(mod1, .~.-PSI)
summary(mod1)
## 
## Call:
## lm(formula = AADT ~ SEG_LENGTH + I(FUNCCLASSI) + SURFWD + THRULNQTY + 
##     THRULNWD, data = traffic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22941  -3002   -718   2473  35516 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      18283.673    936.969  19.514  < 2e-16 ***
## SEG_LENGTH                        1714.346   1546.790   1.108  0.26777    
## I(FUNCCLASSI)4  Minor Arterial  -12476.450    230.450 -54.139  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16776.388    250.931 -66.856  < 2e-16 ***
## SURFWD                              40.085      4.272   9.382  < 2e-16 ***
## THRULNQTY                         2335.030     90.144  25.903  < 2e-16 ***
## THRULNWD                          -262.899     82.217  -3.198  0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5287 on 5111 degrees of freedom
## Multiple R-squared:  0.7153, Adjusted R-squared:  0.7149 
## F-statistic:  2140 on 6 and 5111 DF,  p-value: < 2.2e-16

So without \(\verb|PSI|\) the largest p-value is \(\verb|SEG_LENGTH|\). So let’s see how the model is without it.

mod1 = update(mod1, .~.-SEG_LENGTH)
summary(mod1)
## 
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD, 
##     data = traffic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22847  -3016   -735   2465  35441 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      18372.956    933.521  19.681  < 2e-16 ***
## I(FUNCCLASSI)4  Minor Arterial  -12435.033    227.405 -54.682  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16734.328    248.051 -67.463  < 2e-16 ***
## SURFWD                              39.856      4.267   9.339  < 2e-16 ***
## THRULNQTY                         2336.661     90.134  25.924  < 2e-16 ***
## THRULNWD                          -263.073     82.219  -3.200  0.00138 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5288 on 5112 degrees of freedom
## Multiple R-squared:  0.7152, Adjusted R-squared:  0.7149 
## F-statistic:  2567 on 5 and 5112 DF,  p-value: < 2.2e-16

This looks good! Each of the p-values are well below our alpha of 0.10. So our first linear model is

\[ \widehat {\verb|AADT|} = 18844 - 12353 \cdot \verb|I|\verb|(Class=4|) -16705 \cdot \verb|I|\verb|(Class=5|) + 44\cdot \verb|SURFWD| + 2167\cdot \verb|THRULNQTY| -294\cdot \verb|THRULNWD| \]

Model Validity

We want to show that a linear model is a good type of model for this interaction. On the residual plot below, this is indicated by the points being symmetrically distributed above and below the value of 0. Furthermore, the residual points should be uniform and not fan out to ensure our model has equal variance of residuals.

plot(mod1$residuals)

Although this isn’t perfect, there isn’t anything so egregious to justify not using a linear model.

We also want our model to be normally distributed, which on the Q-Q plot below would look like a straight 45 degree line.

plot(mod1,which=2)

Like before, this isn’t perfect, but given that there are over 6000 observations, it is not unexpected to see at least some theoretical quantiles fall below -4 or above 4.

Linear Model 2 - Some Features, No Outliers, No Interactions

Outliers

outliers = outlierTest(mod1)
subset(traffic,rownames(traffic)%in%names(outliers$rstudent))
##      SEG_LENGTH        ROUTENAME SEGMDIR                    FUNCCLASSI SURFWD
## 141        0.06 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     76
## 484        0.01      N PEORIA ST       N             4  Minor Arterial     48
## 510        0.03      N PEORIA ST       N             4  Minor Arterial     48
## 761        0.01 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     76
## 815        0.01 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     20
## 943        0.04 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     20
## 2432       0.10        E 1ST AVE       E 3  Principal Arterial - Other     36
## 3171       0.08 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     84
## 3810       0.02 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     20
## 5111       0.04 W 6TH AVENUE FWY       W 3  Principal Arterial - Other     20
##      PSI THRULNQTY THRULNWD  AADT AADTDERIV RUNLENGTH1
## 141    1         6       12 66000         3       0.26
## 484    4         4       12 40000         3       0.41
## 510    4         4       12 40000         3       0.40
## 761    1         6       12 66000         3       0.20
## 815    4         6       10 66000         3       0.31
## 943    4         6       10 66000         3       0.19
## 2432   4         3       12 52000         0       0.10
## 3171   4         6       10 66000         3       0.08
## 3810   4         6       10 66000         3       0.10
## 5111   4         6       10 66000         3       0.30

So our only outlines were 1st, 6th, and Peoria Street. These are all larger streets. 1st is particularly impressive, with only 3 lanes and an \(\verb|AADT|\) of 52000. In an effort to better model smaller streets, we acknowledge that these outliers are skewing the linear model and affecting the model. We will also try a model without these observations for thoroughness.

traffic = subset(traffic,!rownames(traffic)%in%names(outliers$rstudent))
mod2 = lm(AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD , data = traffic)
summary(mod2)
## 
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD, 
##     data = traffic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22366.8  -3004.1   -697.6   2407.4  25043.7 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      18844.542    900.267  20.932  < 2e-16 ***
## I(FUNCCLASSI)4  Minor Arterial  -12353.450    219.175 -56.363  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16705.626    238.951 -69.912  < 2e-16 ***
## SURFWD                              44.674      4.124  10.832  < 2e-16 ***
## THRULNQTY                         2167.743     87.276  24.838  < 2e-16 ***
## THRULNWD                          -294.249     79.288  -3.711 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5091 on 5102 degrees of freedom
## Multiple R-squared:  0.7235, Adjusted R-squared:  0.7232 
## F-statistic:  2670 on 5 and 5102 DF,  p-value: < 2.2e-16

As expected, removing these few points didn’t actually change that much with our correlation coefficients nor the associated statistics.

\[ \widehat {\verb|AADT|} = 18844 - 12353 \cdot \verb|I|\verb|(Class=4|) -16705 \cdot \verb|I|\verb|(Class=5|) + 44\cdot \verb|SURFWD| + 2167\cdot \verb|THRULNQTY| -294\cdot \verb|THRULNWD| \]

Linear Model 3 - Some Features, No Outliers, Interactions

Interactions

It is also important to consider how features interact with each other. For example, dividing the width of lanes by the inverse of lane quantity can be considered some sort of density. Or, given that there is a lot of time spent in acceleration, and deceleration, it would be understandable if the length of the run had a cubic interaction to account for saved time and ease of flow on longer stretches.

mod3 = lm(AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD + THRULNWD/THRULNQTY + I(RUNLENGTH1^3), data = traffic)
summary(mod3)
## 
## Call:
## lm(formula = AADT ~ I(FUNCCLASSI) + SURFWD + THRULNQTY + THRULNWD + 
##     THRULNWD/THRULNQTY + I(RUNLENGTH1^3), data = traffic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22639.1  -2945.0   -713.3   2359.9  24877.7 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      23756.682   2139.413  11.104  < 2e-16 ***
## I(FUNCCLASSI)4  Minor Arterial  -12434.206    220.533 -56.383  < 2e-16 ***
## I(FUNCCLASSI)5  Major Collector -16875.499    245.288 -68.799  < 2e-16 ***
## SURFWD                              45.546      4.147  10.982  < 2e-16 ***
## THRULNQTY                          321.932    751.792   0.428 0.668509    
## THRULNWD                          -733.181    196.884  -3.724 0.000198 ***
## I(RUNLENGTH1^3)                     -9.346      4.366  -2.141 0.032339 *  
## THRULNQTY:THRULNWD                 169.734     69.463   2.444 0.014579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5086 on 5100 degrees of freedom
## Multiple R-squared:  0.7241, Adjusted R-squared:  0.7237 
## F-statistic:  1912 on 7 and 5100 DF,  p-value: < 2.2e-16

\[ \widehat {\verb|AADT|} = 23756 - 12434 \cdot \verb|I|\verb|(Class=4|) -16875 \cdot \verb|I|\verb|(Class=5|) + 45\cdot \verb|SURFWD| + 321\cdot \verb|THRULNQTY| + 169 \cdot \frac{\verb|THRULNWD|}{\verb|THRULNQTY|} -733\cdot \verb|THRULNWD| -9\cdot \verb|RUNLENGTH|^3 \]

Including these interactions made the model more convoluted and difficult to predict with. Furthermore, including them greatly changed the intercept and the coefficient for \(\verb|THRULNQTY|\). So it is questionable that this would be a better model in practice. Accordingly, we propose the use of the second model generally, and it will be the one used in latter parts of the project.

Cross Validation

trafficTest = read.csv("DenTraffic2019Test.csv")
y_pred = predict(mod2, type = 'response', newdata = trafficTest)
MAPE(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.753313
MSE(y_pred=y_pred,y_true = trafficTest$AADT)
## [1] 30168916
NormalizedGini(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.823578
goodPreds = y_pred > .9*trafficTest$AADT & y_pred < 1.1*trafficTest$AADT
sum(goodPreds)/length(y_pred)
## [1] 0.1936219

Hmm these results don’t seem too promising. It looks like only about 18% of predictions were within 10% of the real world value. Furthermore, an mean absolute percentage error of 75% indicates that the model does quite poorly, but the normalized GINI coefficient of 0.82 indicates that our model does well given its circumstances. These values shouldn’t change much for our different models though, so we’ll only show the statistics for the 3th model.

y_pred = predict(mod3, type = 'response', newdata = trafficTest)
MAPE(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.7474903
MSE(y_pred=y_pred,y_true = trafficTest$AADT)
## [1] 30193413
NormalizedGini(y_pred=y_pred, y_true = trafficTest$AADT)
## [1] 0.8231322
goodPreds = y_pred > .9*trafficTest$AADT & y_pred < 1.1*trafficTest$AADT
sum(goodPreds)/length(y_pred)
## [1] 0.2095672

Nothing changes that much. Our models are very similar and had similar \(R^2\) scores so it’s very understandable that they performed similarly.

Export Models

saveRDS(mod1,"model1Full")
saveRDS(mod2,"model2Thin")
saveRDS(mod3,"model3Inter")
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
fig <- plot_ly(traffic, x = ~AADT, y = ~SURFWD, z = ~THRULNQTY, type = 'scatter3d',color=traffic$FUNCCLASSI, opacity = 1, mode= "marker")

fig